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multicore  chips  as  commodity  architecture 
for  platforms  ranging  from  handhelds  to 
supercomputers  herald  an  era  when  parallel 
programming  and  computing  will  be  the  norm. 
While  the  computer  science  and  engineering 
community  has  periodically  focused  on  advancing 
the  technology  for  parallel  processing,8  this  time 
around  the  stakes  are  truly  high,  since  there  is 
no  obvious  route  to  higher  performance  other 
than  through  parallelism.  However,  for  parallel 
computing  to  become  widespread,  breakthroughs 
are  needed  in  all  layers  of  the  computing  stack, 
including  languages,  programming  models, 
compilation  and  runtime  software,  programming 
and  debugging  tools,  and  hardware  architectures. 

At  the  hardware-architecture  layer,  we  need  to 
change  the  way  multicore  architectures  are  designed. 


In  the  past,  architectures  were  de¬ 
signed  primarily  for  performance  or 
for  energy  efficiency.  Looking  ahead, 
one  of  the  top  priorities  must  be  for 
the  architecture  to  enable  a  program¬ 
mable  environment.  In  practice,  pro¬ 
grammability  is  a  notoriously  difficult 
metric  to  define  and  measure.  At  the 
hardware-architecture  level,  program¬ 
mability  implies  two  things:  First,  the 
architecture  is  able  to  attain  high  ef¬ 
ficiency  while  relieving  the  program¬ 
mer  from  having  to  manage  low-level 
tasks;  second,  the  architecture  helps 
minimize  the  chance  of  (parallel)  pro¬ 
gramming  errors. 

In  this  article,  we  describe  a 
novel,  general-purpose  multicore 
architecture — the  Bulk  Multicore — 
we  designed  to  enable  a  highly  pro¬ 
grammable  environment.  In  it,  the 
programmer  and  runtime  system 
are  relieved  of  having  to  manage  the 
sharing  of  data  thanks  to  novel  sup¬ 
port  for  scalable  hardware  cache  co¬ 
herence.  Moreover,  to  help  minimize 
the  chance  of  parallel-programming 
errors,  the  Bulk  Multicore  provides 
to  the  software  high-performance  se¬ 
quential  memory  consistency  and  also 
introduces  several  novel  hardware 
primitives.  These  primitives  can  be 
used  to  build  a  sophisticated  program- 
development-and-debugging  environ¬ 
ment,  including  low-overhead  data- 
race  detection,  deterministic  replay 
of  parallel  programs,  and  high-speed 
disambiguation  of  sets  of  addresses. 
The  primitives  have  an  overhead  low 
enough  to  always  be  “on”  during  pro¬ 
duction  runs. 

The  key  idea  in  the  Bulk  Multi¬ 
core  is  twofold:  First,  the  hardware 
automatically  executes  all  software 
as  a  series  of  atomic  blocks  of  thou¬ 
sands  of  dynamic  instructions  called 
Chunks.  Chunk  execution  is  invisible 
to  the  software  and,  therefore,  puts  no 
restriction  on  the  programming  lan¬ 
guage  or  model.  Second,  the  Bulk  Mul¬ 
ticore  introduces  the  use  of  Hardware 
Address  Signatures  as  a  low-overhead 
mechanism  to  ensure  atomic  and  iso¬ 
lated  execution  of  chunks  and  help 
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maintain  hardware  cache  coherence. 

The  programmability  advantages  of 
the  Bulk  Multicore  do  not  come  at  the 
expense  of  performance.  On  the  con¬ 
trary,  the  Bulk  Multicore  enables  high 
performance  because  the  processor 
hardware  is  free  to  aggressively  reor¬ 
der  and  overlap  the  memory  accesses 
of  a  program  within  chunks  without 
risk  of  breaking  their  expected  behav¬ 
ior  in  a  multiprocessor  environment. 
Moreover,  in  an  advanced  Bulk  Mul¬ 
ticore  design  where  the  compiler  ob¬ 
serves  the  chunks,  the  compiler  can 
further  improve  performance  by  heav¬ 
ily  optimizing  the  instructions  within 
each  chunk.  Finally,  the  Bulk  Multi¬ 
core  organization  decreases  hardware 


design  complexity  by  freeing  proces¬ 
sor  designers  from  having  to  worry 
about  many  corner  cases  that  appear 
when  designing  multiprocessors. 

Architecture 

The  Bulk  Multicore  architecture  elim¬ 
inates  one  of  the  traditional  tenets  of 
processor  architecture,  namely  the 
need  to  commit  instructions  in  order, 
providing  the  architectural  state  of  the 
processor  after  every  single  instruc¬ 
tion.  Having  to  provide  such  state  in 
a  multiprocessor  environment — even 
if  no  other  processor  or  unit  in  the 
machine  needs  it — contributes  to  the 
complexity  of  current  system  designs. 
This  is  because,  in  such  an  environ¬ 


ment,  memory-system  accesses  take 
many  cycles,  and  multiple  loads  and 
stores  from  both  the  same  and  dif¬ 
ferent  processors  overlap  their  execu¬ 
tion. 

In  the  Bulk  Multicore,  the  default 
execution  mode  of  a  processor  is  to 
commit  chunks  of  instructions  at  a 
time.2  A  chunk  is  a  group  of  dynami¬ 
cally  contiguous  instructions  (such  as 
2,000  instructions).  Such  a  “chunked” 
mode  of  execution  and  commit  is  a 
hardware-only  mechanism,  invisible 
to  the  software  running  on  the  pro¬ 
cessor.  Moreover,  its  purpose  is  not  to 
parallelize  a  thread,  since  the  chunks 
in  a  thread  are  not  distributed  to  other 
processors.  Rather,  the  purpose  is  to 
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improve  programmability  and  perfor¬ 
mance. 

Each  chunk  executes  on  the  pro¬ 
cessor  atomically  and  in  isolation. 
Atomic  execution  means  that  none  of 
the  chunk’s  actions  are  made  visible 
to  the  rest  of  the  system  (processors  or 
main  memory)  until  the  chunk  com¬ 
pletes  and  commits.  Execution  in  iso¬ 
lation  means  that  if  the  chunk  reads  a 
location  and  (before  it  commits)  a  sec¬ 
ond  chunk  in  another  processor  that 
has  written  to  the  location  commits, 


then  the  local  chunk  is  squashed  and 
must  re-execute. 

To  execute  chunks  atomically  and 
in  isolation  inexpensively,  the  Bulk 
Multicore  introduces  hardware  ad¬ 
dress  signatures.3  A  signature  is  a 
register  of  -1,024  bits  that  accumu¬ 
lates  hash-encoded  addresses.  Figure 
1  outlines  a  simple  way  to  generate  a 
signature  (see  the  sidebar  “Signatures 
and  Signature  Operations  in  Hard¬ 
ware”  for  a  deeper  discussion).  A  sig¬ 
nature,  therefore,  represents  a  set  of 


Signatures  and  Signature 
Operations  in  Hardware 

Figure  1  in  the  main  text  shows  a  simple  implementation  of  a  signature.  The  bits  of  an 
incoming  address  go  through  a  fixed  permutation  to  reduce  collisions  and  are  then 
separated  in  bit-fields  Cu  Each  field  is  decoded  and  accumulated  into  a  bit-field  Vj  in  the 
signature.  Much  more  sophisticated  implementations  are  also  possible. 

A  module  called  the  Bulk  Disambiguation  Module  contains  several  signature 
registers  and  simple  functional  units  that  operate  efficiently  on  signatures.  These 
functional  units  are  invisible  to  the  instruction-set  architecture.  Note  that,  given  a 
signature,  we  can  recover  only  a  superset  of  the  addresses  originally  encoded  into  the 
signature.  Consequently,  the  operations  on  signatures  produce  conservative  results. 

The  figure  here  outlines  five  signature  functional  units:  intersection,  union,  test 
for  null  signature,  test  for  address  membership,  and  decoding  (S).  Intersection  finds 
the  addresses  common  to  two  signatures  by  performing  a  bit-wise  AND  of  the  two 
signatures.  The  resulting  signature  is  empty  if,  as  shown  in  the  figure,  any  of  its  bit- 
fields  contains  all  zeros.  Union  finds  all  addresses  present  in  at  least  one  signature 
through  a  bit-wise  OR  of  the  two  signatures.  Testing  whether  an  address  a  is  present 
(conservatively)  in  a  signature  involves  encoding  a  into  a  signature,  intersecting  the 
latter  with  the  original  signature  and  then  testing  the  result  for  a  null  signature. 

Decoding  (6)  a  signature  determines  which  cache  sets  can  contain  addresses 
belonging  to  the  signature.  The  set  bitmask  produced  by  this  operation  is  then  passed 
to  a  finite-state  machine  that  successively  reads  individual  lines  from  the  sets  in  the 
bitmask  and  checks  for  membership  to  the  signature.  This  process  is  used  to  identify 
and  invalidate  all  the  addresses  in  a  signature  that  are  present  in  the  cache. 

Overall,  the  support  described  here  enables  low-overhead  operations  on  sets  of 
addresses.3 


Operations  on  signatures. 


addresses. 

In  the  Bulk  Multicore,  the  hard¬ 
ware  automatically  accumulates  the 
addresses  read  and  written  by  a  chunk 
into  a  read  (R)  and  a  write  (W)  signa¬ 
ture,  respectively.  These  signatures 
are  kept  in  a  module  in  the  cache  hi¬ 
erarchy.  This  module  also  includes 
simple  functional  units  that  operate 
on  signatures,  performing  such  op¬ 
erations  as  signature  intersection  (to 
find  the  addresses  common  to  two 
signatures)  and  address  membership 
test  (to  find  out  whether  an  address 
belongs  to  a  signature),  as  detailed  in 
the  sidebar. 

Atomic  chunk  execution  is  sup¬ 
ported  by  buffering  the  state  gener¬ 
ated  by  the  chunk  in  the  LI  cache. 
No  update  is  propagated  outside  the 
cache  while  the  chunk  is  executing. 
When  the  chunk  completes  or  when  a 
dirty  cache  line  with  address  in  the  W 
signature  must  be  displaced  from  the 
cache,  the  hardware  proceeds  to  com¬ 
mit  the  chunk.  A  successful  commit 
involves  sending  the  chunk’s  W  sig¬ 
nature  to  the  subset  of  sharer  proces¬ 
sors  indicated  by  the  directory2  and 
clearing  the  local  R  and  W  signatures. 
The  latter  operation  erases  any  record 
of  the  updates  made  by  the  chunk, 
though  the  written  lines  remain  dirty 
in  the  cache. 

The  W  signature  carries  enough 
information  to  both  invalidate  stale 
lines  from  the  other  coherent  caches 
(using  the  8  signature  operation  on  W, 
as  discussed  in  the  sidebar)  and  en¬ 
force  that  all  other  processors  execute 
their  chunks  in  isolation.  Specifically, 
to  enforce  that  a  processor  executes  a 
chunk  in  isolation  when  the  processor 
receives  an  incoming  signature  Wine, 
its  hardware  intersects  Winc  against 
the  local  Rioc  and  WIoc  signatures.  If  any 
of  the  two  intersections  is  not  null,  it 
means  (conservatively)  that  the  local 
chunk  has  accessed  a  data  element 
written  by  the  committing  chunk. 
Consequently,  the  local  chunk  is 
squashed  and  then  restarted. 

Figure  2  outlines  atomic  and  iso¬ 
lated  execution.  Thread  0  executes 
a  chunk  that  writes  variables  B  and 
C,  and  no  invalidations  are  sent  out. 
Signature  W0  receives  the  hashed  ad¬ 
dresses  of  B  and  C.  At  the  same  time, 
Thread  1  issues  reads  for  B  and  C, 
which  (by  construction)  load  the  non- 
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speculative  values  of  the  variables — 
namely,  the  values  before  Thread  0’s 
updates.  When  Thread  O' s  chunk  com¬ 
mits,  the  hardware  sends  signature  W0 
to  Thread  1,  and  W0  and-R0  are  cleared. 
At  the  processor  where  Thread  1  runs, 
the  hardware  intersects  W0  with  the 
ongoing  chunk’s  Ri  and  Wi.  Since  W0 
fl  Ri  is  not  null,  the  chunk  in  Thread  1 
is  squashed. 

The  commit  of  chunks  is  serial¬ 
ized  globally.  In  a  bus-based  machine, 
serialization  is  given  by  the  order  in 
which  W  signatures  are  placed  on  the 
bus.  With  a  general  interconnect,  seri¬ 
alization  is  enforced  by  a  (potentially 
distributed)  arbiter  module.2  W  sig¬ 
natures  are  sent  to  the  arbiter,  which 
quickly  acknowledges  whether  the 
chunk  can  be  considered  committed. 

Since  chunks  execute  atomically 
and  in  isolation,  commit  in  program 
order  in  each  processor,  and  there  is 
a  global  commit  order  of  chunks,  the 
Bulk  Multicore  supports  sequential 
consistency  (SC)9  at  the  chunk  level. 
As  a  consequence,  the  machine  also 
supports  SC  at  the  instruction  level. 
More  important,  it  supports  high- 
performance  SC  at  low  hardware  com¬ 
plexity. 

The  performance  of  this  SC  imple¬ 
mentation  is  high  because  (within 
a  chunk)  the  Bulk  Multicore  allows 
memory  access  reordering  and  over¬ 
lap  and  instruction  optimization.  As 
we  discuss  later,  synchronization  in¬ 
structions  induce  no  reordering  con¬ 
straint  within  a  chunk. 

Meanwhile,  hardware-implementa¬ 
tion  complexity  is  low  because  memo¬ 
ry-consistency  enforcement  is  largely 
decoupled  from  processor  structures. 
In  a  conventional  processor  that  is¬ 
sues  memory  accesses  out  of  order, 
supporting  SC  requires  intrusive  pro¬ 
cessor  modifications.  For  example, 
from  the  time  the  processor  executes 
a  load  to  line  L  out  of  order  until  the 
load  reaches  its  commit  time,  the 
hardware  must  check  for  writes  to  L 
by  other  processors — in  case  an  in¬ 
consistent  state  was  observed.  Such 
checking  typically  requires  sending, 
for  each  external  coherence  event,  a 
signal  up  the  cache  hierarchy.  The  sig¬ 
nal  snoops  the  load  queue  to  check  for 
an  address  match.  Additional  modifi¬ 
cations  involve  preventing  cache  dis¬ 
placements  that  could  risk  missing  a 


Figure  1.  A  simple  way  to  generate  a  signature. 
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Figure  2.  Executing  chunks  atomically  and  in  isolation  with  signatures. 
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coherence  event.  Consequently,  load 
queues,  LI  caches,  and  other  critical 
processor  components  must  be  aug¬ 
mented  with  extra  hardware. 

In  the  Bulk  Multicore,  SC  enforce¬ 
ment  and  violation  detection  are  per¬ 
formed  with  simple  signature  inter¬ 
sections  outside  the  processor  core. 
Additionally,  caches  are  oblivious  to 
what  data  is  speculative,  and  their  tag 
and  data  arrays  are  unmodified. 

Finally,  note  that  the  Bulk  Mul¬ 
ticore’s  execution  mode  is  not  like 
transactional  memory.6  While  one 
could  intuitively  view  the  Bulk  Multi¬ 
core  as  an  environment  with  transac¬ 
tions  occurring  all  the  time,  the  key 
difference  is  that  chunks  are  dynamic 
entities,  rather  than  static,  and  invis¬ 
ible  to  the  software. 

High  Programmability 

Since  chunked  execution  is  invisible 
to  the  software,  it  places  no  restriction 
on  programming  model,  language, 


or  runtime  system.  However,  it  does 
enable  a  highly  programmable  envi¬ 
ronment  by  virtue  of  providing  two 
features:  high-performance  SC  at  the 
hardware  level  and  several  novel  hard¬ 
ware  primitives  that  can  be  used  to 
build  a  sophisticated  program-devel- 
opment-and-debugging  environment. 

Unlike  current  architectures,  the 
Bulk  Multicore  supports  high-per¬ 
formance  SC  at  the  hardware  level. 
If  we  generate  code  for  the  Bulk  Mul¬ 
ticore  using  an  SC  compiler  (such  as 
the  BulkCompiler1),  we  attain  a  high- 
performance,  fully  SC  platform.  The 
resulting  platform  is  highly  program¬ 
mable  for  several  reasons.  The  first  is 
that  debugging  concurrent  programs 
with  data  races  would  be  much  easier. 
This  is  because  the  possible  outcomes 
of  the  memory  accesses  involved  in 
the  bug  would  be  easier  to  reason 
about,  and  the  debugger  would  in 
fact  be  able  to  reproduce  the  buggy 
interleaving.  Second,  most  existing 
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software  correctness  tools  (such  as 
Microsoft’s  CHESS14)  assume  SC.  Veri¬ 
fying  software  correctness  under  SC  is 
already  difficult,  and  the  state  space 
balloons  if  non-SC  interleavings  need 
to  be  verified  as  well.  In  the  next  few 
years,  we  expect  that  correctness-veri¬ 
fication  tools  will  play  a  larger  role  as 
more  parallel  software  is  developed. 
Using  them  in  combination  with  an 
SC  platform  would  make  them  most 
effective. 

A  final  reason  for  the  program¬ 
mability  of  an  SC  platform  is  that  it 
would  make  the  memory  model  of 
safe  languages  (such  as  Java)  easier 
to  understand  and  verify.  The  need  to 
provide  safety  guarantees  and  enable 
performance  at  the  same  time  has  re¬ 
sulted  in  an  increasingly  complex  and 
unintuitive  memory  model  over  the 
years.  A  high-performance  SC  memo¬ 
ry  model  would  trivially  ensure  Java’s 
safety  properties  related  to  memory 
ordering,  improving  its  security  and 
usability. 

The  Bulk  Multicore’s  second  fea¬ 
ture  is  a  set  of  hardware  primitives 
that  can  be  used  to  engineer  a  sophis¬ 
ticated  program-development-and- 
debugging  environment  that  is  always 
“on,”  even  during  production  runs. 
The  key  insight  is  that  chunks  and 
signatures  free  development  and  de¬ 
bugging  tools  from  having  to  record 
or  be  concerned  with  individual  loads 
and  stores.  As  a  result,  the  amount  of 
bookkeeping  and  state  required  by 
the  tools  is  substantially  reduced,  as 
is  the  time  overhead.  Here,  we  give 
three  examples  of  this  benefit  in  the 
areas  of  deterministic  replay  of  paral¬ 
lel  programs,  data-race  detection,  and 
high-speed  disambiguation  of  sets  of 
addresses. 

Note,  too,  that  chunks  provide  an 
excellent  primitive  for  supporting 
popular  atomic-section-based  tech¬ 
niques  for  programmability  (such  as 
thread-level  speculation17  and  trans¬ 
actional  memory6). 

Deterministic  replay  of  parallel  pro¬ 
grams  with  practically  no  log.  Hard- 
ware-assisted  deterministic  replay 
of  parallel  programs  is  a  promising 
technique  for  debugging  parallel 
programs.  It  involves  a  two-step  pro¬ 
cess.20  In  the  recording  step,  while 
the  parallel  program  executes,  spe¬ 
cial  hardware  records  into  a  log  the 


The  Bulk  Multicore 
supports 
high-performance 
sequential  memory 
consistency  at 
low  hardware 
complexity. 


order  of  data  dependences  observed 
among  the  multiple  threads.  The  log 
effectively  captures  the  “interleaving” 
of  the  program’s  threads.  Then,  in  the 
replay  step,  while  the  parallel  program 
is  re-executed,  the  system  enforces 
the  interleaving  orders  encoded  in  the 
log. 

In  most  proposals  of  determinis¬ 
tic  replay  schemes,  the  log  stores  in¬ 
dividual  data  dependences  between 
threads  or  groups  of  dependences 
bundled  together.  In  the  Bulk  Multi¬ 
core,  the  log  must  store  only  the  total 
order  of  chunk  commits,  an  approach 
we  call  DeLorean.13  The  logged  infor¬ 
mation  can  be  as  minimalist  as  a  list 
of  committing-processor  IDs,  assum¬ 
ing  the  chunking  is  performed  in  a 
deterministic  manner;  therefore,  the 
chunk  sizes  can  be  deterministically 
reproduced  on  replay.  This  design, 
which  we  call  OrderOnly,  reduces  the 
log  size  by  nearly  an  order  of  magni¬ 
tude  over  previous  proposals. 

The  Bulk  Multicore  can  further  re¬ 
duce  the  log  size  if,  during  the  record¬ 
ing  step,  the  arbiter  enforces  a  certain 
order  of  chunk  commit  interleaving 
among  the  different  threads  (such  as 
by  committing  one  chunk  from  each 
processor  round  robin).  In  this  case 
of  enforced  chunk-commit  order,  the 
log  practically  disappears.  During  the 
replay  step,  the  arbiter  reinforces  the 
original  commit  algorithm,  forcing 
the  same  order  of  chunk  commits  as 
in  the  recording  step.  This  design, 
which  we  call  PicoLog,  typically  incurs 
a  performance  cost  because  it  can 
force  some  processors  to  wait  during 
recording. 

Figure  3a  outlines  a  parallel  execu¬ 
tion  in  which  the  boxes  are  chunks 
and  the  arrows  are  the  observed  cross¬ 
thread  data  dependences.  Figure  3b 
shows  a  possible  resulting  execution 
log  in  OrderOnly,  while  Figure  3c 
shows  the  log  in  PicoLog. 

Data-race  detection  at  production- 
run  speed.  The  Bulk  Multicore  can 
support  an  efficient  data-race  detec¬ 
tor  based  on  the  “happens-before” 
method10  if  it  cuts  the  chunks  at  syn¬ 
chronization  points,  rather  than  at 
arbitrary  dynamic  points.  Synchroni¬ 
zation  points  are  easily  recognized  by 
hardware  or  software,  since  synchro¬ 
nization  operations  are  executed  by 
special  instructions.  This  approach 
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is  described  in  ReEnact16;  Figure  4  in¬ 
cludes  examples  with  a  lock,  flag,  and 
barrier. 

Each  chunk  is  given  a  counter 
value  called  ChunkID  following  the 
happens-before  ordering.  Specifi¬ 
cally,  chunks  in  a  given  thread  receive 
ChunkIDs  that  increase  in  program 
order.  Moreover,  a  synchroniza¬ 
tion  between  two  threads  orders  the 
ChunkIDs  of  the  chunks  involved  in 
the  synchronization.  For  example,  in 
Figure  4a,  the  chunk  in  Thread  2  fol¬ 
lowing  the  lock  acquire  (Chunk  5) 
sets  its  ChunkID  to  be  a  successor  of 
both  the  previous  chunk  in  Thread  2 
(Chunk  4)  and  the  chunk  in  Thread  1 
that  released  the  lock  (Chunk  2).  For 
the  other  synchronization  primitives, 
the  algorithm  is  similar.  For  exam¬ 
ple,  for  the  barrier  in  Figure  4c,  each 
chunk  immediately  following  the  bar¬ 
rier  is  given  a  ChunkID  that  makes  it  a 
successor  of  all  the  chunks  leading  to 
the  barrier. 

Using  ChunkIDs,  we’ve  given  a 
partial  ordering  to  the  chunks.  For 
example,  in  Figure  4a,  Chunks  1  and 
6  are  ordered,  but  Chunks  3  and  4  are 
not.  Such  ordering  helps  detect  data 
races  that  occur  in  a  particular  execu¬ 
tion.  Specifically,  when  two  chunks 
from  different  threads  are  found  to 
have  a  data-dependence  at  runtime, 
their  two  ChunkIDs  are  compared.  If 
the  ChunkIDs  are  ordered,  this  is  not 
a  data  race  because  there  is  an  inter¬ 
vening  synchronization  between  the 
chunks.  Otherwise,  a  data  race  has 
been  found. 

A  simple  way  to  determine  when 
two  chunks  have  a  data-dependence 
is  to  use  the  Bulk  Multicore  signa¬ 
tures  to  tell  when  the  data  footprints 
of  two  chunks  overlap.  This  opera¬ 
tion,  together  with  the  comparison 
and  maintenance  of  ChunkIDs,  can 
be  done  with  low  overhead  with  hard¬ 
ware  support.  Consequently,  the  Bulk 
Multicore  can  detect  data  races  with¬ 
out  significantly  slowing  the  program, 
making  it  ideal  for  debugging  produc¬ 
tion  runs. 

Enhancing  programmability  by  mak¬ 
ing  signatures  visible  to  software.  Final¬ 
ly,  a  technique  that  improves  program¬ 
mability  further  is  to  make  additional 
signatures  visible  to  the  software.  This 
support  enables  inexpensive  monitor¬ 
ing  of  memory  accesses,  as  well  as 


Making  Signatures 
Visible  to  Software 

We  propose  that  the  software  interact  with  some  additional  signatures  through  three 
main  primitives:18 

The  first  is  to  explicitly  encode  into  a  signature  either  one  address  (Figure  la)  or  all 
addresses  accessed  in  a  code  region  (Figure  lb).  The  latter  is  enabled  by  the  bcollect 
(begin  collect)  and  ecollect  (end  collect)  instructions,  which  can  be  set  to  collect  only 
reads,  only  writes,  or  both. 

The  second  primitive  is  to  disambiguate  the  addresses  accessed  by  the  processor 
in  a  code  region  against  a  given  signature.  It  is  enabled  by  the  bdisamb.loc  (begin 
disambiguate  local)  and  edisamb.loc  (end  disambiguate  local)  instructions  (Figure  lc), 
and  can  disambiguate  reads,  writes,  or  both. 

The  third  primitive  is  to  disambiguate  the  addresses  of  incoming  coherence 
messages  (invalidations  or  downgrades)  against  a  given  local  signature.  It  is  enabled 
by  the  bdisamb.rem  (begin  disambiguate  remote)  and  ediscnnb.rem  (end  disambiguate 
remote)  instructions  (Figure  Id)  and  can  disambiguate  reads,  writes,  or  both.  When 
disambiguation  finds  a  match,  the  system  can  deliver  an  interrupt  or  set  a  bit. 

Figure  2  includes  three  examples  of  what  can  be  done  with  these  primitives:  Figure 
2a  shows  how  the  machine  inexpensively  supports  many  watchpoints.  The  processor 
encodes  into  signature  Sig2  the  address  of  variable  y  and  all  the  addresses  accessed  in 
function/oof).  It  then  watches  all  these  addresses  by  executing  bdisamb.loc  on  Sig2. 

Figure  2b  shows  how  a  second  call  to  a  function  that  reads  and  writes  memory  in 
its  body  can  be  skipped.  In  the  figure,  the  code  calls  function/oofj  twice  with  the  same 
input  value  of  x.  To  see  if  the  second  call  can  be  skipped,  the  program  first  collects 
all  addresses  accessed  by  foo()  in  Sig2.  It  then  disambiguates  all  subsequent  accesses 
against  Sig2.  When  execution  reaches  the  second  call  to/oof),  it  can  skip  the  call  if  two 
conditions  hold:  the  first  is  that  the  disambiguation  did  not  find  a  conflict;  the  second 
(not  shown  in  the  figure)  is  that  the  read  and  write  footprints  of  the  first  foo()  call  do  not 
overlap.  This  possible  overlap  is  checked  by  separately  collecting  the  addresses  read 
in  foo()  and  those  written  in  foo()  in  separate  signatures  and  intersecting  the  resulting 
signatures. 

Finally,  Figure  2c  shows  a  way  to  detect  data  dependences  between  threads  running 
on  different  processors.  In  the  figure,  collect  encodes  all  addresses  accessed  in  a 
code  section  into  Sig2.  Surrounding  the  collect  instructions,  the  code  places  disamb. 
rem  instructions  to  monitor  if  any  remotely  initiated  coherence-action  conflicts  with 
addresses  accessed  locally.  To  disregard  read-read  conflicts,  the  programmer  can 
collect  the  reads  in  a  separate  signature  and  perform  remote  disambiguation  of  only 
writes  against  that  signature. 


Figure  1.  Primitives  enabling  software  to  interact  with  additional  signatures: 
collection  (a  and  b),  local  disambiguation  (c),  and  remote  disambiguation  (d). 
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Figure  2.  Using  signatures  to  support  data  watchpoints  (a),  skip  execution  of 
functions  (b),  and  detect  data  dependencies  between  threads  running  on 
different  processors  (c). 
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novel  compiler  optimizations  that  re¬ 
quire  dynamic  disambiguation  of  sets 
of  addresses  (see  the  sidebar  “Making 
Signatures  Visible  to  Software”). 

Reduced  Implementation 
Complexity 

The  Bulk  Multicore  also  has  advan¬ 
tages  in  performance  and  in  hardware 
simplicity.  It  delivers  high  perfor¬ 
mance  because  the  processor  hard¬ 
ware  can  reorder  and  overlap  all  mem¬ 
ory  accesses  within  a  chunk — except, 
of  course,  those  that  participate  in 
single-thread  dependences.  In  partic¬ 
ular,  in  the  Bulk  Multicore,  synchroni¬ 
zation  instructions  do  not  constrain 
memory  access  reordering  or  overlap. 
Indeed,  fences  inside  a  chunk  are 
transformed  into  null  instructions. 
Fences’  traditional  functionality  of 
delaying  execution  until  certain  ref¬ 
erences  are  performed  is  useless;  by 
construction,  no  other  processor  ob¬ 
serves  the  actual  order  of  instruction 
execution  within  a  chunk. 

Moreover,  a  processor  can  concur¬ 
rently  execute  multiple  chunks  from 
the  same  thread,  and  memory  access¬ 
es  from  these  chunks  can  also  overlap. 
Each  concurrently  executing  chunk 
in  the  processor  has  its  own  R  and  W 
signatures,  and  individual  accesses 
update  the  corresponding  chunk’s 
signatures.  As  long  as  chunks  within 
a  processor  commit  in  program  order 
(if  a  chunk  is  squashed,  its  succes¬ 
sors  are  also  squashed),  correctness  is 
guaranteed.  Such  concurrent  chunk 
execution  in  a  processor  hides  the 
chunk-commit  overhead. 

Bulk  Multicore  performance  in¬ 
creases  further  if  the  compiler  gener¬ 
ates  the  chunks,  as  in  the  BulkCom- 
piler.1  In  this  case,  the  compiler  can 
aggressively  optimize  the  code  within 
each  chunk,  recognizing  that  no  other 
processor  sees  intermediate  states 
within  a  chunk. 

Finally,  the  Bulk  Multicore  needs 
simpler  processor  hardware  than  cur¬ 
rent  machines.  As  discussed  earlier, 
much  of  the  responsibility  for  mem¬ 
ory-consistency  enforcement  is  taken 
away  from  critical  structures  in  the 
core  (such  as  the  load  queue  and  LI 
cache)  and  moved  to  the  cache  hierar¬ 
chy  where  signatures  detect  violations 
of  SC.2  For  example,  this  property 
could  enable  a  new  environment  in 


Figure  3.  Parallel  execution  in  the  Bulk  Multicore  (a),  with  a  possible 
OrderOnly  execution  log  (b)  and  PicoLog  execution  log  (c). 
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Figure  4.  Forming  chunks  for  data-race  detection  in  the  presence 
of  a  lock  (a),  flag  (b),  and  barrier  (c). 
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which  cores  and  accelerators  are  de¬ 
signed  without  concern  for  how  to  sat¬ 
isfy  a  particular  set  of  access-ordering 
constraints.  This  ability  allows  hard¬ 
ware  designers  to  focus  on  the  novel 
aspects  of  their  design,  rather  than 
on  the  interaction  with  the  target  ma¬ 
chine’s  legacy  memory-consistency 
model.  It  also  motivates  the  develop¬ 
ment  of  commodity  accelerators. 

Related  Work 

Numerous  proposals  for  multipro¬ 
cessor  architecture  designs  focus  on 
improving  programmability.  In  par¬ 
ticular,  architectures  for  thread-level 
speculation  (TLS)17  and  transactional 
memory  (TM)6  have  received  signifi¬ 
cant  attention  over  the  past  15  years. 
These  techniques  share  key  primitive 
mechanisms  with  the  Bulk  Multicore, 
notably  speculative  state  buffering 


and  undo  and  detection  of  cross¬ 
thread  conflicts.  However,  they  also 
have  a  different  goal,  namely  simplify 
code  parallelization  by  parallelizing 
the  code  transparently  to  the  user 
software  in  TLS  or  by  annotating  the 
user  code  with  constructs  for  mutual 
exclusion  in  TM.  On  the  other  hand, 
the  Bulk  Multicore  aims  to  provide  a 
broadly  usable  architectural  platform 
that  is  easier  to  program  for  while  de¬ 
livering  advantages  in  performance 
and  hardware  simplicity. 

Two  architecture  proposals  in¬ 
volve  processors  continuously  execut¬ 
ing  blocks  of  instructions  atomically 
and  in  isolation.  One  of  them,  called 
Transactional  Memory  Coherence  and 
Consistency  (TCC),5  is  a  TM  environ¬ 
ment  with  transactions  occurring  all 
the  time.  TCC  mainly  differs  from  the 
Bulk  Multicore  in  that  its  transactions 
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are  statically  specified  in  the  code, 
while  chunks  are  created  dynamically 
by  the  hardware.  The  second  propos¬ 
al,  called  Implicit  Transactions,19  is 
a  multiprocessor  environment  with 
checkpointed  processors  that  regular¬ 
ly  take  checkpoints.  The  instructions 
executed  between  checkpoints  consti¬ 
tute  the  equivalent  of  a  chunk.  No  de¬ 
tailed  implementation  of  the  scheme 
is  presented. 

Automatic  Mutual  Exclusion 
(AME)* 1 2 3 * * 6 7 8 9 10 11  is  a  programming  model  in 
which  a  program  is  written  as  a  group 
of  atomic  fragments  that  serialize  in 
some  manner.  As  in  TCC,  atomic  sec¬ 
tions  in  AME  are  statically  specified 
in  the  code,  while  the  Bulk  Multicore 
chunks  are  hardware-generated  dy¬ 
namic  entities. 

The  signature  hardware  we’ve  in¬ 
troduced  here  has  been  adapted  for 
use  in  TM  (such  as  in  transaction- 
footprint  collection  and  in  address 
disambiguation12 13 *’21). 

Several  proposals  implement  data- 
race  detection,  deterministic  replay  of 
multiprocessor  programs,  and  other 
debugging  techniques  discussed  here 
without  operating  in  chunks.4,11’15’20 
Comparing  their  operation  to  chunk 
operation  is  the  subject  of  future  work. 

Future  Directions 

The  Bulk  Multicore  architecture  is  a 
novel  approach  to  building  shared- 
memory  multiprocessors,  where  the 
whole  execution  operates  in  atomic 
chunks  of  instructions.  This  approach 
can  enable  significant  improvements 
in  the  productivity  of  parallel  pro¬ 
grammers  while  imposing  no  restric¬ 
tion  on  the  programming  model  or 
language  used. 

At  the  architecture  level,  we  are  ex¬ 
amining  the  scalability  of  this  organi¬ 
zation.  While  chunk  commit  requires 
arbitration  in  a  (potentially  distrib¬ 
uted)  arbiter,  the  operation  in  chunks 
is  inherently  latency  tolerant.  At  the 
programming  level,  we  are  examin¬ 
ing  how  chunk  operation  enables 
efficient  support  for  new  program- 
development  and  debugging  tools, 
aggressive  autotuners  and  compilers, 
and  even  novel  programming  models. 
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